In [ ]:
# Import all of the things you need to import!
In [2]:
import scipy
import sklearn
import nltk
import pandas as pd
The Congressional Record is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?
Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from this page here.
In [3]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz
In [3]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz
You can explore the files if you'd like, but we're going to get the ones from convote_v1.1/data_stage_one/development_set/
. It's a bunch of text files.
In [4]:
# glob finds files matching a certain filename pattern
import glob
# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]
Out[4]:
In [5]:
len(paths)
Out[5]:
So great, we have 702 of them. Now let's import them.
In [6]:
speeches = []
for path in paths:
with open(path) as speech_file:
speech = {
'pathname': path,
'filename': path.split('/')[-1],
'content': speech_file.read()
}
speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()
Out[6]:
In class we had the texts
variable. For the homework can just do speeches_df['content']
to get the same sort of list of stuff.
Take a look at the contents of the first 5 speeches
In [7]:
speeches_df['content'].head(5)
Out[7]:
In [50]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=100, stop_words='english')
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
In [51]:
X = count_vectorizer.fit_transform(speeches_df['content'])
In [52]:
X.toarray()
Out[52]:
Okay, it's far too big to even look at. Let's try to get a list of features from a new CountVectorizer
that only takes the top 100 words.
In [53]:
tophundred_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
In [54]:
tophundred_df
Out[54]:
In [ ]:
Now let's push all of that into a dataframe with nicely named columns.
In [ ]:
Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?
In [55]:
mrchairman_df = pd.DataFrame([tophundred_df['mr'], tophundred_df['chairman'], tophundred_df['mr'] + tophundred_df['chairman']], index=["mr", "chairman", "mr + chairman"]).T
In [56]:
mrchairman_df
Out[56]:
In [83]:
num_speeches = len(mrchairman_df)
In [87]:
mrmention_df = mrchairman_df[(mrchairman_df['mr'] > 0)]
mr_mention = len(mrmention_df)
In [88]:
mrorchairmanmention_df = mrchairman_df[(mrchairman_df['mr + chairman'] > 0)]
mrorchair_mention = len(mrorchairmanmention_df)
In [93]:
print("There are",num_speeches,"speeches. Only", num_speeches - mr_mention, "do not mention mr and ", num_speeches - mrorchair_mention, "do not mention mr or chairman")
In [ ]:
In [95]:
tophundred_df.columns
Out[95]:
What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?
In [96]:
tophundred_df['thank'].sort_values(ascending=False).head(1) # thank is not in the top 100 words unless you remove stop words
Out[96]:
If I'm searching for China
and trade
, what are the top 3 speeches to read according to the CountVectoriser
?
In [106]:
chinatrade_df = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T
In [110]:
chinatrade_df['China + trade'].sort_values(ascending=False).head(3)
Out[110]:
Now what if I'm using a TfidfVectorizer
?
In [121]:
porter_stemmer = PorterStemmer()
def stemming_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
words = [porter_stemmer.stem(word) for word in words]
return words
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=False, norm='l1', max_features = 100)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
chinatrade_tfidfpd = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
In [125]:
chinatrade_tfidfpd = pd.DataFrame([tophundred_df['china'] + tophundred_df['trade']], index=["China + trade"]).T
In [128]:
# chinatrade_tfidfpd
In [126]:
chinatrade_tfidfpd['China + trade'].sort_values(ascending=False).head(3)
Out[126]:
What's the content of the speeches? Here's a way to get them:
In [129]:
# index 0 is the first speech, which was the first one imported.
paths[0]
Out[129]:
In [130]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}
Now search for something else! Another two terms that might show up. elections
and chaos
? Whatever you thnik might be interesting.
In [ ]:
In [ ]:
Using a simple counting vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency inverse document frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
In [152]:
# Initialize a vectorizer
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=stemming_tokenizer, stop_words='english', max_features =8)
X = vectorizer.fit_transform(speeches_df['content'])
In [153]:
X
Out[153]:
In [154]:
pd.DataFrame(X.toarray())
Out[154]:
In [155]:
from sklearn.cluster import KMeans
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)
Out[155]:
In [156]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))
In [157]:
km.labels_
Out[157]:
In [159]:
speeches_df['content']
Out[159]:
In [160]:
results = pd.DataFrame()
results['content'] = speeches_df['content']
results['category'] = km.labels_
results
Out[160]:
In [161]:
vectorizer.get_feature_names()
Out[161]:
In [ ]:
In [ ]:
In [167]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
df
Out[167]:
In [ ]:
Which one do you think works the best?
In [ ]:
In [ ]:
In [ ]:
I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.
I want you to read them in, vectorize them and cluster them. Use this process to find out the two types of Harry Potter fanfiction. What is your hypothesis?
In [29]:
!curl -LO https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip
In [ ]:
In [ ]: